Nejaký text z čoho pozostavala praca alebo ustredna tema? Zadanie? kto je sponzor? Kroky?
The initial data was provided in a comma-separated values file, and was loaded and processed using the R programming language. Dataset used in this analysis contains 2 353 012 observations and 24 variables. Out of the 24 variables, 13 are of factor datatype, 9 are numeric and 2 are dates. All columns from the initial data were converted to the correct datype according to the data description file, which was provided. Column “payed_ammount” was replaced by column “paid_amount”. Column “payment_date” originally contained some blank fields, which were subsequently filled in as NA.
| Column name | Description | Type |
|---|---|---|
| contract_id | Unique identificator of the contract | Int |
| payment_order | Order of the payment | Int |
| due_date | Payment deadline | Date |
| payment_date | Date of the payment | Date |
| product_type | Type of the product | Factor |
| contract_status | Contract status | Factor |
| business_discount | Business discount provided | Factor |
| gender | Gender | Factor |
| marital_status | Marital status | Factor |
| number_of_children | Number of children | Int |
| number_other_product | Number of other products | Int |
| clients_phone | T/F if the client filled in home phone | Factor |
| client_mobile | T/F if the client filled in mobile phone | Factor |
| client_email | T/F if the client filled in email address | Factor |
| total_earnings | Earning bucket | Factor |
| birth_year | Birth year of the client | Int |
| birth_month | Birth month of the client | Int |
| living_area | Region of the client home address | Factor |
| different_contact_area | T/F if the client filled different home and contact address | Factor |
| kc_flag | T/F if the client does not home local citizenship | Factor |
| cf_val | If the special measure during the underwriting was applied | |
| kzmz_flag | T/F if the client filled in employer | Factor |
| due_amount | Installment what should be payed | Numeric |
| payed_amount | What was payed at a certain date | Numeric |
Frequency, relative frequency and relative cumulative frequency were computed for each category in all categorical variables. All frequency tables can be located in the “frequencies” list.
m’lady
We computed correlation coefficients between all possible pairs of numeric variables, see Figure ??, and discovered strong positive correlation between due amount and paid amount. This could be due to the fact that in the event that the installment has already been paid, the due amount and the paid amount would assume the same value. Correlation between the remaining pairs of numeric variables was either nonexistent or negligible. Then, the significance of correlation between due amount and paid amount was tested using Pearson’s product moment correlation coefficient. The pair of attributes was found to be significantly correlated with a correlation coefficient of 0.76 and p-value less than 2.2e-16.
Relationship between categorical variables was tested using chi-squared test with the significance level of 0.05. All significantly correlated pairs of variables can be accessed in “categorical_rel” dataframe.
NIECO HODIT
Next step consited of the data coverage analysis. As an example, we have chosen a couple of plots, that indicate interesting data distribution. We for example found out, that clients mostly order the product type 1, contracts are mostly in status number 1 or that most of the payments have a discount. We also discovered, that the marital status of the clients is mostly number 3 and they have most frequently no children. Clients also very frequently do not provide information about their earnings and they ordered usually 1 other product.
All the mentioned findings can be seen on visualizations on Figure ??. All the data coverage visualizations are available in the Attachments.
Exploring the NA values in the dataset, we found out, that 4 attributes had almost same percentage of missing values, as can be seen on the statistics on Figure @ref(fig:Missing_stat). Attributes kc_flag, living_area, cf_val and different_contact_area have the most missing values, almost 20 %, whereas payment_order has around 3,5 % and payment_date and delay have the same percentage, almost 0,5 %.
Using a different visualization, that can be seen on Figure @ref(fig:Missing_complex), we discovered, that the four attributes with the highest percentage are not missing at random but almost all at the same time.
We found out, that contract_id together with payment_order were not creating a unique key of the data payment, but one payment was divided into multiple parts, which was also causing problem with NA values in the four attributes. Data in the four attributes were not copied into other parts of a payment, but were present in just the first payment part. We decided to unify the payment parts into only one payment by summarizing the paid amount of all the parts and using the payment_date of the last paid part. Thanks to the unification, the amount of NA values has markedly decreased.
Secondly, we dealt with the NA values in payment_order and payment_date. Since it was only less then 4 % of the dataset, and it was not possible to substitute the values, we decided to delete the rows.
## # A tibble: 25 x 3
## variable n_miss pct_miss
## <chr> <int> <dbl>
## 1 different_contact_area 471354 20.0
## 2 cf_val 469310 19.9
## 3 living_area 469015 19.9
## 4 kc_flag 468906 19.9
## 5 payment_order 83361 3.54
## 6 payment_date 11733 0.499
## 7 delay 11733 0.499
## 8 contract_id 0 0
## 9 due_date 0 0
## 10 product_type 0 0
## # ... with 15 more rows
(#fig:Missing_complex)Distribution of missing values
missing_values_plot
We decided to add new features to create higher-accuracy models. First of all, we computed a numerical feature “delay” counting the difference between payment_date and due_date.
Since we are creating two classification models deciding whether a new payment will be delayed for more than 21 days or more than 140 days, we created 2 new binary features “delay_21_y” and “delay_140_y”.
We also created a new numerical feature “delay_indiv” counting the mean delay for the whole client´s history. We also computed 2 new numerical features, “delay_indiv_21” and “delay_indiv_140” counting cumulative number of delayed payments (21, 140 days) by one contract
Lastly, numerical features “mean_delay_1m” ,“mean_delay_3m”, “mean_delay_6m”, “mean_delay_12m” are computing the mean delay for 1/3/6/12 months for each contract.
| Column name | Description | Type |
|---|---|---|
| delay | Difference between payment_date and due_date | Int |
| delay_21_y | T/F if the delay is more than 21 days | Factor |
| delay_140_y | T/F if the delay is more than 140 days | Int |
| delay_indiv | Mean delay for the whole client’s history | Int |
| delay_indiv_21 | Cumulative sum of payments delayed for more than 21 days by contract | Int |
| delay_indiv_140 | Cumulative sum of the payments delayed for more than 140 days by contract | Int |
| mean_delay_1m | Average payment delay for the last month | Int |
| mean_delay_3m | Average payment delay for the last 3 months | Int |
| mean_delay_6m | Average payment delay for the last 6 months | Int |
| mean_delay_12m | Average payment delay for the last 12 months | Int |
We computed basic statistics for each new attribute. All the visualizations can be seen on @ref(fig:stat_new_delay).
First of all we focused on attribute delay. Since it is probably the most important attribute of the nexly created, we analysed its distribution, distribution of delay on log scale DOPLNIT CO SME NASLI.(#fig:stat_new_delay-1)Basic statistics of the added attribute delay.
## [[1]]
(#fig:stat_new_delay-2)Basic statistics of the added attribute delay.
##
## [[2]]
## NULL
##
## [[3]]
(#fig:stat_new_delay-3)Basic statistics of the added attribute delay.
##
## [[4]]
(#fig:stat_new_delay-4)Basic statistics of the added attribute delay.
(#fig:stat_new)Basic statistics of the added attributes..
As can be seen on Figure, newly-created features also contain NA values. The highest percentage of missing values has attribute mean_delay_12m, almost 55 %. Together with mean_delay_6m, mean_delay_3m and mean_delay_1m, they are the only new attributes holding NA attributes.
It is not surprising, that these attributes have the highest percentage of NAs, since they compute results only every 12/6/3/1 months.
We decided to replace the NA values by 0, so they can be later used in the modeling part.
## # A tibble: 34 x 3
## variable n_miss pct_miss
## <chr> <int> <dbl>
## 1 mean_delay_12m 804268 55.4
## 2 mean_delay_6m 458935 31.6
## 3 mean_delay_3m 249422 17.2
## 4 mean_delay_1m 95489 6.58
## 5 contract_id 0 0
## 6 payment_order 0 0
## 7 due_date 0 0
## 8 living_area 0 0
## 9 different_contact_area 0 0
## 10 kc_flag 0 0
## # ... with 24 more rows